Assignment 03

Author
Affiliation

Cindy Guzman

Boston University

Published

November 21, 2024

Modified

September 23, 2025

1 Load the Dataset

The instruction below provides you with general keywords for columns used in the lightcast file. See the data schema generated after the load dataset code above to use proper column name. For each visualization, customize colors, fonts, and styles to avoid a 2.5-point deduction. Also, provide a two-sentence explanation describing key insights drawn from the graph.

  1. Load the Raw Dataset: -Use Pyspark to the ‘lightcast_data.csv’ file into DataFrame: -You can reuse the previous code. -Copying code from your friend constitutes plagiarism. DO NOT DO THIS.
from pyspark.sql import SparkSession
import pandas as pd
import plotly.express as px
import plotly.io as pio
import numpy as np
import plotly.graph_objects as go
from pyspark.sql.functions import col, split, explode, regexp_replace, transform, when
from pyspark.sql import functions as F
from pyspark.sql.functions import col, monotonically_increasing_id

np.random.seed(42)

pio.renderers.default = "notebook"

# Initialize Spark Session
spark = SparkSession.builder.appName("LightcastData").getOrCreate()

# Load Data
df = spark.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").option("escape", "\"").csv("./data/lightcast_job_postings.csv")

# Show Schema and Sample Data
# print("---This is Diagnostic check, No need to print it in the final doc---")

# df.printSchema() # comment this line when rendering submission
# df.show(5)  
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/23 00:48:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 1:>                                                          (0 + 1) / 1]                                                                                
# Histogram of SALARY distribution
from pyspark.sql.functions import floor

binned_df = df.filter(col("SALARY").isNotNull() & (col("SALARY") > 0)) \
              .withColumn("SALARY_BIN", floor(col("SALARY") / 5000) * 5000) \
              .groupBy("SALARY_BIN").count() \
              .orderBy("SALARY_BIN") \
              .toPandas()

fig = px.bar(binned_df, x="SALARY_BIN", y="count", title="Salary Distribution by Bin")
fig.update_layout(xaxis_title="Salary Bin", yaxis_title="Frequency", bargap=0.1)
[Stage 2:>                                                          (0 + 1) / 1]                                                                                

2 Data Preparation

# Step 1: Casting salary and experience columns
df = df.withColumn("SALARY", col("SALARY").cast("float")) \
       .withColumn("SALARY_FROM", col("SALARY_FROM").cast("float")) \
       .withColumn("SALARY_TO", col("SALARY_TO").cast("float")) \
         .withColumn("MIN_YEARS_EXPERIENCE", col("MIN_YEARS_EXPERIENCE").cast("float")) \
       .withColumn("MAX_YEARS_EXPERIENCE", col("MAX_YEARS_EXPERIENCE").cast("float"))
       


# Step 2: Computing medians for salary columns
def compute_median(sdf, col_name):
    q = sdf.approxQuantile(col_name, [0.5], 0.01)
    return q[0] if q else None

median_from = compute_median(df, "SALARY_FROM")
median_to = compute_median(df, "SALARY_TO")
       
print("Medians:", median_from, median_to)

#Step 3: Imputing missing salaries, but not experience
df = df.fillna({
  "SALARY_FROM": median_from, 
  "SALARY_TO": median_to
  })

# Step 5: Computing average salary
df = df.withColumn("Average_Salary", (col("SALARY_FROM") + col("SALARY_TO")) / 2)

# Step 6: Selecting required columns
export_cols = [
  "EDUCATION_LEVELS_NAME" ,
  "REMOTE_TYPE_NAME" ,
  "MAX_YEARS_EXPERIENCE" ,
  "Average_Salary" ,
  "LOT_V6_SPECIALIZED_OCCUPATION_NAME"
]
df_selected = df.select(*export_cols)

# Step 7: Saving to CSV
pdf = df_selected.toPandas()
pdf.to_csv("./data/lightcast_cleaned.csv", index=False)

print("Data cleaning complete. Rows retained:", len(pdf))
pdf.head() # Preview the first few rows
[Stage 10:>                                                         (0 + 1) / 1]                                                                                [Stage 11:>                                                         (0 + 1) / 1]                                                                                
Medians: 87295.0 130042.0
[Stage 12:>                                                         (0 + 1) / 1]                                                                                
Data cleaning complete. Rows retained: 72498
EDUCATION_LEVELS_NAME REMOTE_TYPE_NAME MAX_YEARS_EXPERIENCE Average_Salary LOT_V6_SPECIALIZED_OCCUPATION_NAME
0 [\n "Bachelor's degree"\n] [None] 2.0 108668.5 General ERP Analyst / Consultant
1 [\n "No Education Listed"\n] Remote 3.0 108668.5 Oracle Consultant / Analyst
2 [\n "Bachelor's degree"\n] [None] NaN 108668.5 Data Analyst
3 [\n "No Education Listed"\n] [None] NaN 108668.5 Data Analyst
4 [\n "No Education Listed"\n] [None] NaN 92500.0 Oracle Consultant / Analyst

3 Salary Distribution by Industry and Employment Type

  • Compare salary variations across industries.
  • Filter the dataset
  • Remove records where salary is missing or zero.
  • Aggregate Data
    • Group by NAICS industry codes
    • Group by employment type and compute salary distribution.
    • Calculate salary percentiles (25th, 50th, 75th) for each group.
  • Visualize results
    • Create a box plot where:
    • X-axis = NAICS2_NAME
    • Y-axis = SALARY_FROM, or SALARY_TO, or SALARY
    • Group by EMPLOYMENT_TYPE_NAME.
  • Customize colors, fonts, and styles.
  • Explanation: Write two sentences about what the graph reveals.

4 Salary Analysis by ONET Occupation Type (Bubble Chart)

  • Analyze how salaries differ across ONET occupation types.
  • Aggregate Data
  • Compute median salary for each occupation in the ONET taxonomy.
  • Visualize results
    • Create a bubble chart where:
    • X-axis = ONET_NAME
    • Y-axis = `